Creating musical features using multi-faceted, multi-task encoders based on transformers

Computational machine intelligence approaches have enabled a variety of music-centric technologies in support of creating, sharing and interacting with music content. A strong performance on specific downstream application tasks, such as music genre detection and music emotion recognition, is paramount to ensuring broad capabilities for computational music understanding and Music Information Retrieval. Traditional approaches have relied on supervised learning to train models to support these music-related tasks. However, such approaches require copious annotated data and still may only provide insight into one view of music—namely, that related to the specific task at hand. We present a new model for generating audio-musical features that support music understanding, leveraging self-supervision and cross-domain learning. After pre-training using masked reconstruction of musical input features using self-attention bidirectional transformers, output representations are fine-tuned using several downstream music understanding tasks. Results show that the features generated by our multi-faceted, multi-task, music transformer model, which we call M3BERT, tend to outperform other audio and music embeddings on several diverse music-related tasks, indicating the potential of self-supervised and semi-supervised learning approaches toward a more generalized and robust computational approach to modeling music. Our work can offer a starting point for many music-related modeling tasks, with potential applications in learning deep representations and enabling robust technology applications.


Related work
Transformer models. In the past few years, pre-trained models and self-supervised representation learning have yielded great success on NLP tasks. Many self-supervised pre-trained models based on multi-layer self-attention transformers 17 , such as BERT 18 , GPT 19 , XLNet 12 , and Electra 20 , have been used effectively. BERT is perhaps the most popular model due to its simplicity and outstanding performance across a variety of tasks. BERT reconstructs masked input sequences in its pre-training stage; through reconstruction, the model learns a powerful contextual representation of its input. More recently, the success of BERT in NLP has drawn attention from researchers in acoustic signal processing. Some pioneering works [7][8][9][10]21,22 have shown the effectiveness of adapting BERT and other self-supervised approaches to Automatic Speech Recognition (ASR). By designing pre-training objectives specific to the audio modality, it is possible to adapt BERT-like models to music and other audio domains. In vq-wav2vec 21 , input speech audio is first discretized to a K-way quantized embedding space by learning discrete representation from audio samples. However, the quantization process requires heavy computing resources and runs counter to the continuous nature of acoustic frames. Other works 7-10, 23 have designed modified versions of BERT that directly utilize continuous speech. In some works 7,23 , and 8 , continuous framelevel masked reconstructions were adapted in a BERT-like pre-training stage. In other work 10 , SpecAugment 24 was applied to mask input frames, and another method 7 learned by reconstruction after shuffling acoustic frame orders rather than masking frames. Within the MIR realm, representation learning has been popular for many years. Several convolutional neural network-(CNN-) based supervised methods [3][4][5][6]25 have been proposed for various music understanding tasks. These usually employ convolutional layers on Mel-spectrogram-based representations or raw waveform signals of music audio to learn effective music representations, and append fully connected layers to predict relevant annotations such as music genres or moods. However, training CNN-based models usually requires large datasets with reliable and consistent human-annotated labels. Other music representations have used contrastive learning [26][27][28][29] for generating audio embeddings for downstream tasks. Carmon 30 and Hendrycks 31 have shown that using self-supervision on unlabeled data can significantly improve model robustness. More recently, self-attention transformers have shown promising results in music generation. For example, the Music Transformer 32 and Pop Music Transformer 33 employed relative attention to capture longterm structure from music MIDI data; however, compared with raw music audio, the size of existing MIDI datasets is limited. Transcription from raw audio to MIDI files is time-consuming and often not accurate, necessitating a transformer system that accepts (continuous) audio input. Other works have investigated lowering the computational cost of using transformers, potentially enabling greater model complexity and modeling capacity 28 .
Multi-task learning. Multi-task learning (MTL) is an approach that involves assigning several tasks to a model to train on simultaneously 34 . This approach has been used to great extent in several music-related tasks, including frequency estimation 35 , source separation 36 and instrument detection 37 . It is common for multitask systems to favor well-represented tasks, sometimes at the expense of under-represented tasks 38 , and some research has attempted to ameliorate this problem 39,40 . As far as the authors know, self-supervised representations in music have not been fine-tuned on multiple music tasks, let alone tasks that span regression and classification. Ideally, musical features that show utility on several downstream music tasks simultaneously would be highly desirable for music research, providing a "one stop shop" to researchers attempting various tasks related to music understanding and MIR.
In this work, we propose M3BERT, a universal music-acoustic encoder based on transformers and multi-task learning. M3BERT is first pre-trained on large amounts of unlabeled music datasets, and then fine-tuned using an MTL approach on specific downstream music annotation tasks using labeled data.

M3BERT model
A universal transformer-based encoder named M3BERT is presented for music representation learning. The system overview of the proposed M3BERT model is shown in Fig. 1 Pre-training and training. The main idea of masked reconstruction pre-training is to perturb inputs by randomly masking tokens with some probability and then using the model to reconstruct these masked tokens at the output. Intuitively, this is similar to dropout 42 , in which certain features or layers in a neural network are set to zero in order to prevent overfitting. In the pre-training process, a reconstruction module, which consists of two feed-forward layers with GeLU activation 43 and layer-normalization 44 , is appended to the encoder-decoder architecture to predict the masked inputs. The multi-task system then uses the output of the last M3BERT encoder layer as its input. For clarity, we call M2BERT the transformer component of the overall model; M3BERT refers to the transformer with the additional multi-task layer of enrichment. Several masking policies are presented for enabling M3BERT to learn music representations.
Masking policy 1: contiguous frame masking (CFM). To prevent the model from exploiting local smoothness of acoustic frames, we mask spans of consecutive frames dynamically. Given a sequence of input frames X = (x 1 , x 2 , . . . , x n ) , we select a subset Y ⊂ X by iteratively sampling contiguous input frames (spans) until the masking budget (in this case, 15% of X) has been spent. At each iteration, a span length is first sampled from the geometric distribution l ∼ Geo(p) . Then, the starting point of the masked span is randomly selected. We set p = 0.2 , l min = 2 and l max = 7 . The corresponding mean length of span is around 3.87 frames (179.6ms). Other schemes were also tried (variable lengths with different averages, constant lengths, etc.), but this scheme proved highest performance on downstream tasks. In each masked span, the frames are masked according to the following policy: (1) With 70% probability, replace all frames with zero. Since each dimension of input frames is normalized to have zero mean, setting the masked value to zero is equivalent to setting it equal to the mean. (2) Replace all frames with a random masking frame with 20% probability (mutually exclusive from 1).
(3) Keep the original frames unchanged in the remaining cases (this happens 10% of the time). Since M3BERT will only receive acoustic frames without masking during inference time, this policy allows the model to receive real inputs during pre-training, resolving the pre-train/fine-tune inconsistency problem 18 .
Masking policy 2: contiguous channel masking (CCM). The intuition of channel masking is that a model that can predict the partial loss of channel information has learned a high-level representation of such channels. For log-mel spectrum and log-CQT features, a block of consecutive channels is randomly masked to zero for all time steps across the input sequence of frames. Specifically, the number of masked channels, c, is first sampled from 1, . . . , H uniformly, where H is the number of total channels (in our case, this is 272). Then a starting channel index h is sampled uniformly from 1, . . . , H − c and the channels h, h + c are masked.
Masking policy 3: patch masking (PM). Often, music can be dynamic, quickly changing pitch, amplitude, and timbre. For this reason, it can be prohibitively difficult for a decoder to accurately reconstruct contiguous frames of features, particularly over long spans of music. Prior work in audio-based transformers has proposed patch masking 16 , which involves masking a square set of features (channels) and timesteps (frames). In the patch masking paradigm, squares of equal size are sampled with replacement until 15% of the input matrix is masked (see Fig. 1). We use this policy in comparison with a policy that uses CCM and CFM in tandem, which was found to be the best policy in a prior study 15 .
Pre-training objective function.
(1) Table 1. Acoustic features of music extracted by Librosa 41 . We sought to use musical inputs that captured musical qualities such as timbre, melody, harmony, and spectrum (frequency-amplitude relationships).  45 to minimize the reconstruction error between masked input features and the corresponding encoder output. Huber loss is a robust ℓ 1 loss that is less sensitive to outliers 46 . Additionally, a prior study 15 found that using Huber loss made training converge faster than ℓ 1 loss.

M3BERT model parameters.
We report experimental results on two models: M3BERTSmall and M3BERT-Large. Model settings are listed in Table 2. The number of transformer block layers, the size of hidden vectors, and the number of self-attention heads are represented as L num , H dim , and A num , respectively.

Methods
Dataset curation and preprocessing. As shown in Table 3 www.nature.com/scientificreports/ ferent samples. If a song is more than 30 s long but less than 60 s long, it is split up into two equal parts without overlap, as this ensures that every example is at least 15 s long and no more than 30 s long. This allows for more pre-training examples, along with potential bias: a long track may have more representation in the final embedding than a shorter song. As we have hundreds of thousands of training examples, accept the risk of skewed representation. The representations produced by the transformer are fine-tuned on five downstream tasks in tandem (see Figs. 3 and 4.): the GTZAN music genre classification task 51 , MTG-Jamendo music auto-tagging task 49 , Real World Computing (RWC) Instrument Classification task 52 , Database for Emotional Analysis of Music (DEAM) task 53 , and the Extended Ballroom task 54 were all used to fine-tune M3BERT.
GTZAN consists of 1000 music clips divided into ten different genres (blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock). Each genre consists of 100 music clips in .wav format, each with a duration of 30s.  www.nature.com/scientificreports/ The MTG-Jamendo task consists of over 18,000 music clips, each with at least one mood or theme label. These genres range from common ("Happy" and the thirteen other most common tags are present in 68% of examples) to uncommon (the "Sexy" tag is present in .64% of samples) and the imbalance factor (the count of the most common tag divided by the count of the least common tag) is 15.7.
The Extended Ballroom dataset is an augmented version of the Ballroom dataset 55 . This dataset contains 4,180 music clips divided into 13 genres representing various ballroom dances (Cha Cha, Jive, Quickstep, etc). As these genres are closely related to rhythmic patterns, they can also be considered as rhythm classes. This dataset's imbalance factor is also quite high, at 23 (Waltz is the most common label, and West Coast Swing is the least common). While other metadata is available (for example, artist and beats per minute of each song), we leave the possibility of leveraging such information for future work.
The RWC Musical Instrument Sound Database covers 50 musical instruments. At least three musicians played each instrument and at least three different manufacturers' models were used for each instrument. To further provide a wide variety of musical instrument performances, the dataset includes samples from every tonal and dynamic range of each instrument.
After breaking long songs into smaller 30s chunks, the DEAM dataset consisted of 2099 excerpts annotated for overall (per-excerpt) emotional valence and arousal. Each sample was appraised for (perceived) valence and arousal by at least five annotators, and triplet embeddings of these labels were computed as in other studies 56,57 .
For GTZAN, we used the fault-filtered splits given in other literature 58 ; for MTG-Jamendo, we organized the training, validation and testing sets as in previous literature as well 59 . For all other datasets, we could not find an agreed-upon set of splits in prior work, so we split up our data randomly into five equal parts, using three parts for training, one part for validation, and one part for testing. We split these data sets into equal parts according    60,61 were applied to the extracted features to minimize the distortion caused by noise contamination. Finally, these normalized features were concatenated to form a set of 324 features per frame, which was later used as the pretraining input of M3BERT.
Training setup. All of our experiments were conducted on 2 GTX 2080Ti. In pre-training, M3BERTSmall and M3BERTLarge were trained with an effective batch size of 128 for 200k and 500k steps, respectively. We applied an Adam optimizer 62 with β 1 = 0.9 , β 2 = 0.999 and ǫ = 10 −6 . The learning rate followed a warmup schedule 17 according to the formula: l rate = min ( l max s wT , l max (T−s) T(1−w) ) where s represents the step number, w represents the warmup steps (set to 7% of the total steps T), and l max represents the max learning rate (set to 2 · 10 −4 ). For downstream tasks, we performed a grid search on a set of parameters and the model that performed best on the validation set was selected (see Table 4). All other training parameters remained the same as those in the pre-training stage.

Results
Patch masking, CFM and CCM. We first survey the difference between patch masking, CFM, and CCM.
When testing Patch Masking, CFM, and CCM individually on the MTG-Jamendo dataset, we find that Patch Masking outperforms the other two masking policies (Table 5.) However, when CFM and CCM are combined, as was conducted in a similar study 15 , the performance is better than Patch Masking. A hybrid approach of combining CCM, CFM, and Patch Masking simultaneously was not attempted because CCM and CFM already involves contiguous channel and frame masking. In subsequent results, we report on results that use CCM and CFM only. Experiments were conducted on the Jamendo dataset because it is the largest of the fine-tuning datasets and has canonical train-validation-test splits, allowing for seamless comparison to other approaches and masking policies 15.] Evaluation on downstream tasks. For each downstream task reported in the following sections, models using M2BERT and M3BERT embeddings were compared against models that use two commonly-used general-purpose audio features: MFCCs and VGGish embeddings. We also compared our representations against a contrastive learning approach on music, as implemented in previous work on Contrastive Learning of Musical Representations (CLMR) 26 . In addition, the state-of-the-art model performance using task-specific features and architectures is reported, if available. Table 6.

GTZAN. The test accuracy of the GTZAN dataset on the fault-filtered splits is shown in
Although this small dataset is prone to overfitting 51 , the multi-task paradigm does not bring our results close to the performances of the state-of-the-art model, which pretrains a CNN on MSD and then finetunes the entire network on GTZAN, therefore qualifying as a deep end-to-end model.

MTG-Jamendo emotions and themes in music.
For the Jamendo mood-theme auto-tagging task, ROC-AUC macro and PR-AUC macro were used to measure performance. ROC-AUC can lead to over-optimistic scores when data is imbalanced 65 , and since the music tags given in the MTG-Jamendo dataset are highly imbalanced 66,67 , we also used PR-AUC for evaluation. The M3BERT model was compared with other state-ofthe-art models from MediaEval 2020: Emotion and Theme Recognition in Music Using Jamendo 59 . We used the same train-validation-test data splits as the challenge. The results are shown in Table 7. For the baseline model (based on VGGish features 63 ) and the 2019 MediaEval winner 5 , we directly used the evaluation results posted in the competition leaderboard. For the 2020 winner 66 , we reproduced the work according to their implementation. This approach uses focal loss and CNNs to achieve state-of-the-art results. Our results suggest that improvement over past state-of-the-art work on this music auto-tagging task may be possible if a back-end architecture were to be used that integrates information over the temporal domain, such as a CNN. We applied a simple time-distributed dense layer to the output representations from M3BERT.
Extended ballroom genre classification dataset. For the Extended Ballroom genre classification task, our performances were compared against other models, although the splits were different. Evinced by the best performing approach that does not use deep learning in Table 8, we see that rhythmic features appear to be helpful in predicting ballroom music genres, which were not used in our musical inputs. The best performing approach used a CNN-based model for genre prediction. DEAM music emotion recognition task. In the DEAM music emotion recognition task, our representations were compared against other feature sets, including VGGish features and MFCCs. In Table 9, we see that MFCCs perform poorly on this music emotion recognition task, while hand-crafted features and the more generalized VGGish features perform even better than our representations.

RWC instrument detection task.
In the RWC instrument classification task, our representations outperformed the other results found in the literature (see Table 10.) Understandably, timbral MFCC features perform better than VGGish features on instrument detection. It is evident here that representations are enriched in the multi-task stage, as performance is better using M3BERTLarge than using M2BERT.
Ablation study. Ablation studies were conducted to better understand the performance of M3BERT, similar to the work done by Zhao and Guo 15 . The results are shown in Table 11.
We removed datasets from pre-training to assess which datasets were most crucial to good performance on downstream tasks. Removing any dataset from pre-training results in a degradation in downstream performance Table 6. Results of a genre classification task on the GTZAN dataset. Approaches that use deep neural networks for prediction are italicized. Highest value is given in bold.

Model
Accuracy (%)  Table 7. Results of an auto-tagging task on the MTG-Jamendo dataset. Approaches that use deep neural networks for prediction are italicized. Highest values per metric are given in bold. www.nature.com/scientificreports/ on MTG-Jamendo autotagging; the larger the input dataset, the more severe the degradation. The multi-faceted music (M2BERT) model uses the diverse input datasets to inform its representations, and each dataset is evidently bringing a rich set of features for informing pre-training. We also explore the effect that model size has on downstream task accuracy. In our experiments, M3BERT-Large generally outperforms M3BERTSmall, which remains consistent with the findings of Zhao and Guo 15 , although in tasks like valence prediction we see that M3BERTSmall outperforms M3BERTLarge. For other tasks, Table 8. Results of a genre classification task on the Extended Ballroom dataset. * indicates that the model evaluates on different subsets of the dataset than our work and hence numbers are not directly comparable. Approaches that use deep neural networks for prediction are italicized. Highest values per metric are given in bold.

Model Accuracy Macro f1
MFCCs (our implementation  Table 9. Results of a music emotion recognition task on the DEAM dataset. * indicates that the model evaluates on different subsets of the dataset than our work and hence numbers are not directly comparable. Highest values per metric are given in bold. Correlational analysis. Deep learning models and featuresets alike often suffer from a lack of interpretability 74 . In an effort to find representations of music that may be interpretable , we used Librosa 41 to compute several high-level audio features, including brightness, loudness, and spectral flux. We then correlated  Figure 5. Centroid and cell activation. Certain outputs from the M3BERT encoder correlate highly with auditory phenomena, like spectral centroid. Pearson's ρ = .831 between these two features. www.nature.com/scientificreports/ these features with outputs from the M3BERT encoder. Results and correlations are shown in Figs. 5 and 6. We posit that these output representations from M3BERT are both powerful and interpretable , adding to their utility for studying music-related tasks.

Discussion
We see that on several different types of downstream tasks, such as instrument detection and mood-theme autotagging, M3BERT produces features that, when passed through a simple neural network, post performance well better than other music features and-in the case of mood-theme autotagging-on par with the state-ofthe-art model by the ROC-AUC metric. This makes M3BERT a useful first-stop-shop baseline for generating features for application to a diverse set of music-related tasks. We observe that M3BERT performs much better on the mood-theme classification task than the M2BERT model: this may be because the multi-task learning paradigm exploited some labels that were present in the mood-theme detection task and the genre classification tasks. For example, one label in GTZAN is "jazz" and one label in MTG-Jamendo is "jazzy. " Curiously, the genre classification tasks did not benefit as much from multi-task learning; these datasets are relatively small compared to MTG-Jamendo, so in the multi-task paradigm, their samples are likely getting overwhelmed by the prevalence of MTG-Jamendo samples. We observe that performance on tasks with the least amount of training examples seems to degrade after multi-task training. While multi-task learning may not always improve the embeddings' performance, with multi-task-specific loss function adjustments, such as those suggested by Kendall et al. 75 , it may be possible to improve on the results posted here.
In the classification and regression tasks, we averaged outputs across timesteps. This architecture was used for the sake of simplicity in creating representations of music, but it does not take advantage of the temporal dependencies of the musical inputs. If an architecture that captures this temporal information-such as a CNN or LSTM-were to be built upon the features that we created, we would expect to see greater improvement on these downstream tasks.
We see that although M3BERT performed very well on the instrument classification task, it did not perform as well on the GTZAN genre classification or DEAM music emotion recognition tasks. This may also be explained by the relative paucity of data (the MTG-Jamendo dataset is 18 times larger than GTZAN) and the input features we used for pre-training, which may not have spanned feature types that would be relevant for these prediction tasks. To wit, we used many features that related to timbre, which sensibly would perform well on an instrument classification task, but may not necessarily perform well on a music emotion recognition task, for example. Similarly, rhythmic features are shown to be effective in ballroom dance genre classification 68 , but were not represented in our initial input features. From our results, we hypothesize that choosing a broad set of input audio features and balancing fine-tuning across large, diverse datasets are important for creating robust representations of music.
We also note that a contrastive learning approach to creating music representations performs well on the genre detection tasks, outperforming M3BERT representations on the GTZAN dataset and the Extended Ballroom dataset. However, these representations seem to fall short on other tasks, especially the tasks related to music emotion recognition and instrument detection. We hypothesize that augmentations used during pre-training (on Magnatagatune 76 ) do not translate well to music emotion recognition or instrument classification because positive pairs can have different arousal, valence, or sound quality, which could adversely affect embeddings used for related tasks.
In the interest of investigating interpretability of our embeddings, we present two high-level features that are highly correlated with outputs from M3BERT, including harmonicity and spectral centroid. While centroid is a rough measure for a song's pitch, other frequency-based features were also correlated with cell activations, including brightness and spectral rolloff. Harmonicity and percussiveness were both correlated to encoder outputs ( ρ > .8 ), and relate to timbre and, proximally, loudness (we did not analyze Root-Mean-Square of the waveform because it is captured in our encoder inputs by MFCC 0). Other features, including f0, spectral flatness and contrast, and zero crossing rate, were not found to be highly correlated with encoder outputs. These correlations suggest that certain base auditory features, like spectral centroid and harmonicity, are informative for a variety of music-related tasks; M3BERT may be used to uncover such features, providing MIR researchers additional insight into meaningful, interpretable features for tasks of interest.

Conclusion
We propose M3BERT, a universal music encoder based on transformers. Rather than relying on massive human labeled data, which are expensive and time-consuming to collect, M3BERT can learn representations of music from unlabeled data and improve upon its representation with multi-task learning in fine-tuning. Contiguous Frame Masking, Contiguous Channel Masking, and Patch Masking are applied to the pretraining examples and features are created in reconstruction from a BERT-like, self-supervised transformer model. Subsequently, using a multi-task approach, this model enriches its features in a supervised manner, learning from several disparate music information retrieval tasks at once. The effectiveness of different masking policies, datasets, and input features are evaluated through ablation studies. We find that M3BERT outperforms commonly used features for music classification on a variety of music-related tasks, such as instrument classification and mood-theme detection . We also find that multi-task learning tends to enrich the representations generated by our encoder. Our work shows the potential of adapting a transformer-based, masked reconstruction pre-training scheme with multi-task learning to MIR interests. Beyond improving the model, we plan to extend M3BERT to other music understanding tasks, like key estimation and cover song detection, all while managing dataset imbalance to ensure that multi-task enrichment does not favor tasks with more examples. This work shows that marrying